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Abstract 

Due to its universality, swarm behavior in nature attracts much attention of scientists from many fields. Fish schools are 
examples of biological communities that demonstrate swarm behavior. The detection and tracking offish in a school are of 
important significance for the quantitative research on swarm behavior. However, different from other biological 
communities, there are three problems in the detection and tracking offish school, that is, variable appearances, complex 
motion and frequent occlusion. To solve these problems, we propose an effective method offish detection and tracking. In 
this method, first, the fish head region is positioned through extremum detection and ellipse fitting; second, The Kalman 
filtering and feature matching are used to track the target in complex motion; finally, according to the feature information 
obtained by the detection and tracking, the tracking problems caused by frequent occlusion are processed through 
trajectory linking. We apply this method to track swimming fish school of different densities. The experimental results show 
that the proposed method is both accurate and reliable. 
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Introduction 

There has been growing research interest in animal collective 
behavior due to its high scientific values and a wide range of 
potential applications [1-3]. From biological perspective, the study 
of swarm behavior in animals can provide us with a better 
understanding of how animals evolve. In computer science, there 
are techniques being used such as particle swarm optimization and 
ant colony optimization that use these social interactions to solve 
optimization problems. In engineering, the study of swarm 
behavior has been used to create groups of robots that are capable 
of interacting and working together. 

Fish school is one of the most common biological swarms in 
nature. The schooling fish often swim in various shapes. This 
behavior is either spontaneous or for resisting attacks. But what is 
the principle behind the movement? How do fish schools benefit 
from these movements to survive? How could we get revelation of 
bionic algorithm from schooling (artificial fish swarm algorithm)? 
These problems have been intriguing many scientists, especially 
biologists, physicists and computer scientists. Since the 1970's, in 
an attempt to answer these questions, there are already researchers 
from different fields, who have begun to research by quantitative 
analysis [4—10], but because of the variability offish motion and 
the complexity of their environment, currently, the study of fish 
school behavior is still challenging. 

The most informative way to study schooling behavior and to 
discover underlying principles is through acquiring and quantita- 
tively analyzing the motion data of the fish school [11-16]. While 
manual analyses of collective motion is tedious, time-consuming 
and sometimes even impossible, video-tracking technology helps 
rapid and objective quantification of collective motion. As the 



rapid development of image acquisition devices and video tracking 
methods, it has become possible to measure the trajectory of each 
individual in a large group. 

When the schooling behavior is studied in a laboratory 
environment, a common experiment setup is to place a video 
camera vertically on top of a fish tank filled with shallow water as 
shown in Figure 1(a) for which the swimming motion can be 
approximately considered as a movement on a two-dimensional 
plane. While the problem of detection and tracking offish school is 
related to the multi-target detection and tracking problem in the 
field of computer vision and pattern recognition, it has strong 
unique characteristics making it challenging and worth thorough 
investigation. In details, we are faced with the following two 
difficulties. 

(1) Detection problem: first, the shape of the fish is non rigid, 
the outline can't be represented by one or several 
templates; second, the fishes' texture information in the 
video image are less to detect and its location can't be 
detected effectively only by texture features; finally, when 
the fish density is large, the targets will frequently occlude 
each other in the image. With the current target detection 
algorithm, it is difficult to detect the location of each target. 

(2) Tracking problem: first, the motion of fish swimming is so 
complex that the existing models cannot fully simulate; 
second, due to the higher degree of similarity among the 
fish, the use of a single feature method can hardly 
distinguish between different targets; finally, the detecting 
errors caused by fish occlusion will lead to a fragmentation 
in trajectory, adding more difficulties in tracking. 
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Figure 1. (a) Experiment environment; (b) Fish shape model. 

doi:1 0.1 371 /journal.pone.01 06506.g001 



In order to overcome the above mentioned difficulties, we 
propose an effective method for tracking a large number of fish, 
which has the following advantages. 

(1) Based on the analysis of the fish shape, we propose a fish 
head region detection method that integrates local feature 
and geometric feature. First, the extremum detection of the 
entire image is implemented based on the gray distribution 
of the fish head region, and then according to the shape 
characteristics and contrast characteristics of the fish head 
region, the ellipse fitting and parameter estimation are 
conducted to the detected adjacent region of the extreme 
points, and the fish head region is further positioned 
accurately. This method comprehensively applies various 
features of the fish head appearance. It can accurately 
locate the fish head regions under different environments 
with fast computation and strong ability of anti-interfer- 
ence; second, the method simplifies the fish detection 
problem into the detection of one point and its adjacent 
region. Only by detecting partial shape information can we 
determine the fish' position and better deal with the 
occlusion problem in fish school detection. 

(2) According to the detected target position, we first use 
Kalman filter to estimate its motion state, then apply 
feature matching method to realize data association. For 
targets failing to be predicted, we establish a compensation 



window to deal with that. Finally, in order to solve 
trajectory fragmentation caused by occlusion, we propose 
to use time and space information of trajectory fragments 
to complete the trajectory linking. The proposed tracking 
method takes full advantage of a variety of features and 
information in swimming fish school, able to deal with the 
complex motion of fish school and tracking problems 
caused by frequent occlusion effectively. Besides, with a low 
computational complexity, the method is more efficient in 
tracking targets in a large population. 

Using the proposed method, we have successfully tracked the 
motion trajectories offish schools of different densities. In order to 
measure the performance of the method, we compare the tracked 
results with the ground truth obtained by manual tracking. The 
result shows that the proposed method is accurate and robust. 

The Proposed Detection and Tracking Method 

From Figure 1(b), we observe that the fish appears in a top view 
image as consisting of two parts: a rigid anterior part and a 
deformable posterior part that may swing to propel it. Since the 
fish head is rigid and its shape and grayscale distribution keep 
almost constant as the fish swims, an effective way to track fish in 
video is to find the fish head regions and associate them for each 
frame. Following this line of reasoning, the proposed method is 
designed to consist of several steps to track fish school. The first 
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Figure 2. The diagram of the proposed method. 
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Figure 3. The illustration of the ellipse parameters, (a) The raw image of the fish head region; (b) (x 0 ,y 0 ) denotes the extreme point and the 
violet color shows the grayscale distribution of the extreme point region. Variables length, width and 0 represent the long axis, short axis and angle of 
the fitted ellipse respectively; (c) The variable contrast in the direction of the z-axis represents the contrast change of the ellipse and its surrounding 
region. 

doi:1 0.1 371 /journal.pone.01 06506.g003 



step is to detect the fish head from the video frames by using blob 
detection and ellipse fitting. The second step is to filter its motion 
state vector and predict its next value and utilize the predicted 
position together with feature matching results to associate the 
detected head regions of two consecutive frames. The final step is 
to deal with the possible fragmentation of the trajectories caused 
by occlusion via an effective trajectory linking method. An 
overview of the proposed method is shown in Figure 2. The 
following describes each step in detail. 

2.1 Ethics statement 

All experimental procedures were in compliance with the 
Institutional Animal Care and Use Committee (IACUC) of 
Shanghai Research Center for Model Organisms (Shanghai, 
China) with approval ID 2010-0010, and all efforts were made to 
minimize suffering. This study was approved by the Institutional 
Animal Care and Use Committee (IACUC), and written informed 
consent was obtained. 



2.2 Target detection 

For each frame image of the video sequence, we first detect the 
fish head regions. The detection method consists of two parts: 
scale-space DoH blob detection, ellipse fitting and constraint. 

2.2.1 Scale-space DoH blob detection. From Figure 1(b), 
we can see that the pixels inside the fish head region are 
considerably darker than the background ones, the head region is 
partially elliptical, and its width is greater than the rest part of the 
fish body. These characteristics show that fish head region appears 
like a blob and we first use blob detection to find the fish head 
region. In scale-space, it is an effective method to detect image 
blobs by using Determinant of Hessian (DoH) [17,18], which 
reflects the local structural information of the image, and can 
better detect blobs of different scales in the image and well 
suppress the slender blobs in the image. The basic idea of scale 
space is: by introducing a scale parameter in image information 
processing model to obtain visual information at different scales 
through the continuous variation of the scale parameter; then, 
explore the substantive characteristics of image by combining all 
the information. The method of scale space absorbs the traditional 
single-scale visual information to the constantly changing dynamic 




Figure 4. Candidate constraints based on width, contrast and angle. 

doi:1 0.1 371 /journal.pone.01 06506.g004 



PLOS ONE | www.plosone.org 



3 



September 2014 | Volume 9 | Issue 9 | e1 06506 



Automatic Detection and Tracking of Fish School in Shallow Water 



a 





Figure 5. (a) The model of compensation window; (b) The segmentation model of matching region. 

doi:10.1371/journal.pone.0106506.g005 



analysis so as to obtain substantive characteristics of image more 
easily. It has been proved that scale space can be created by 
convolving the image with Gaussian kernel function [19]. The 
Gaussian kernel function has several good properties such as 
linearity, symmetry, separateness and so on, making it the kernel 
function best for the expression of scale space. 

Suppose the pixel point is (x,y,s) in scale-space, where x,y are the 
point's coordinates, s is the scale of the point, the Hessian matrix of 
the point is defined as: 



AM(x,y,s) = (L xx x L vy — L xy ) x s 



(2) 



Then the result in the blob detection is the extreme point of DoH 
responses for the position space and scale space: 



(xoJWo) = ar § nunlocal( w )(AAf(^,s)) 



(3) 



M(x,y,s) = 



Lxx L X y 



(1) 



where L„, L. 



L xy are the convolution results of the Gaussian 



second order derivative and the point (x,y) at scale s respectively. 
The DoH of the matrix can be expressed as: 



Since the head region gray value is less than the background 
region, we retain only the minimum extreme point. To improve 
the accuracy of the extreme point, we use tri-linear interpolation 
method to calculate the related parameters (coordinates and scale) 
of each extreme point. Tri-linear interpolation is a method of 
multivariate interpolation on a three-dimensional regular grid. It 
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Figure 6. The process of feature matching. 
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Figure 7. Trajectory linking based on time and distance. 

doi:10.1371/journal.pone.0106506.g007 



approximates the parameters of an extreme point within the local 
axial rectangular prism linearly, using data on the lattice points. 

When scale-space DoH blob detection is carried out in the 
image, there always exists a stable extreme point at the center of 
the fish head region, which provides a reliable basis for the 
positioning of the head region. However, apart from the fish head 
regions, there may also exist extreme points in other regions. 
Therefore, it is necessary to pick up the extreme points that 
corresponding to the fish head regions out of all extreme points. In 
order to solve this problem, we fit ellipse for each extreme point 
according to the grayscale change of the extreme point region. If 
the extreme point is located in the center of the head region, the 
fitted ellipse can most reflect the characteristics of the head region. 

2.2.2 Ellipse fitting and constraint. There is a correspond- 
ing relationship between the Hessian matrix of the extreme point 
detected in the previous step and the second-order derivative 
matrix. The second-order derivative matrix can also be called the 
autocorrelation matrix, and its eigenvalues can represent the 
curvature of the orthogonal direction, and the change of curvature 
can reflect anisotropy degree of the regional structure. Based on 
this characteristic, we fit ellipse by using the second-order 
derivative matrix of the extreme point, estimate the grayscale 
variance of the extreme point region and finally find the location 
of the fish head. 

Let the second-order derivative matrix (Hessian matrix) 
corresponding to the extreme point (x 0 , Jo, s 0 ) be M(x 0 , y 0 , s 0 ). 
Then the eigenvalues and eigenvectors of the matrix can be 
expressed as: 



h ={L X(jX(s +Ly m + ^ (L X(jX0 - L ym f + 4L 2 xm )/2 
h = (L yo y 0 +Ly m - sj (L XQXQ -L ym f + ALl m )/2 



(4) 



= (o£ LYo ,ai,, 0 ) r = 




La '0->'0 




0-\-Ly 0 y Q ) 2 +Ll m , 




= (X2x 0 A2y Q ) T = 




Lx 0>'0 




(; '2- Z ->'0->'0 )2+L i()>'0 1 





(5) 



where the eigenvectors a u cc 2 correspond to the eigenvalues X\ 
and X 2 respectively (|A 1 |>|A 2 |). The ratio of the eigenvalues is 
defined as r = X\/X2. 



Let the calculated extreme point (xo, yo) as the center of an 
ellipse; the ellipse major axis and the minor axis as the length and 
width of the fish head region, and the direction of the ellipse major 
axis as the direction of the fish head region. Then the length and 
width of the fish head region can be defined as length = sqX \/T7r, 
width = r x a; the orientation angle as 9= arctan(o:2. Y o/ a L\-o)i an d 
the local contrast of each region as: 



contrast ■■ 



f - y/\AM(x 0 yofo)\ x (1 +'f/r, if k x >0 
[ y/\kM{xo,yo,so)\ x (1 + r) 2 /r, otherwise 



(6) 



Some parameters of the method are described in Figure 3. 

The above method will obtain a plurality of candidate regions of 
head position. Because the fish head's width is usually in a certain 
range, we first use the width constraint to remove the candidate 
regions generated by noise interference. Width threshold value w 
can be specified manually. 

After the width constraint, there is still some false detection of 
candidate regions, which occur mainly in fishtail and fish body. In 
order to remove non-head regions from these candidate regions, 
we first perform image segmentation to identify fish regions from 
image. Image segmentation refers to the process of partitioning an 
image into a set of coherent regions. Since the image mainly 
contains two kinds of gray distribution (fish and background), 
thresholding method can distinguish them. Thresholding method 
is a common image segmentation method. It uses a threshold value 
to turn a grayscale image into a binary image. Here, we use Otsu 
method [20] to obtain fish regions from image. Otsu method is 
considered as a good thresholding method for image segmenta- 
tion. It selects the threshold by minimizing the within-class 
variance of the two groups of pixels separated by the thresholding 
operator. Assuming the best segmentation threshold is t, O) 0 and 
a>\ are the proportions of background pixels and foreground pixels 
in the image, Uq and U\ are the mean grays of background and 
foreground respectively, then the t value can be determined by the 
following equation: 



max[coo(?) x w\(t) x (uo(f) — Wi(f)) 



(7) 



According to the results of segmentation, we use contrast 
constraint and angle constraint to remove non-head regions from 
the candidate regions. 

Contrast constraint: The contrast of the fish head region and 
background is larger compared with the other parts. According to 
the results of equation (6), when contrast >k x (uq/u\), the region 
is considered to be an effective head region, where k is a contrast 
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adjustment parameter. This constraint can effectively remove the 

candidate regions of the fishtail. 

Angle constraint: After the contrast constraint, if there are two 

or more candidate regions cri, cr 2 . . xr n in a segmented region, and 

their corresponding orientation angles and contrasts are 6\, 62- ..0 n 

and C\, c 2 ...c„ respectively. If] 9, —6j\ < 30, \<ij<n, it is 

indicated that there is a phenomenon of duplicate detection. In 

this case, we reserve the candidate region of maximum contrast 

cr, = max (c,) and remove the other candidate regions. The 
i</<« 

reason for the angle setting is that when two fish are in mutual 
occlusion, angle between their head orientations is most likely 
greater than 30 degrees with few situations in which the angle is 
less than 30 degrees. We set that only when the angle is less than 
30 degrees will it be constrained by the angle, ensuring that the 
occluded target will not be missed in most cases. For the small 
probability of missed detections, we will solve the problem by using 
trajectory linking method presented in section 2.3.3. Angle 
constraint can effectively remove the fish body candidate regions, 
ensuring that there is only one candidate region within a certain 
angle in a segmented region. Figure 4 shows an example of the 
candidate constraints. 

2.3 Target tracking 

After the detection of fish head region with the above method, 
we will track every detected targets in the whole video. The 
tracking method is generally described in three parts: motion 
prediction, feature matching and trajectory linking. 

2.3.1 Motion prediction. The fish's motion state is repre- 
sented by a four-dimensional state vector x k = (x,y,v x ,v y ), where x 
and y are the coordinates of the target center (ellipse center), v x 
and v y the speeds in the x direction and y direction. Define the 
observation variable z k = (z x , z y ) to indicate the coordinates of the 
ellipse central after data association. Thus the targets' motion state 
becomes able to be predicted by the Kalman filter [21]. Kalman 
filter is an optimal auto-regression data processing algorithm, 
which estimates a signal' current value according to the previous 
estimated value and the most recent observational data, without all 
of the past observation data. The application of Kalman filter in 
tracking can transform a global search into local search to 
accelerate the tracking speed. In addition, when the target is 
blocked or interfered by other factors (the background noise or 
illumination change), Kalman filter' predicted value can be used to 
replace the best associated target to improve the tracking 
performance. To simplify the model, assume target tracking 
system as a linear discrete system, then the system' state equation 
and observation equation are described as: 



Xk=Fx k -i+w k 



(8) 



Zi = H k x k + v k 



(9) 



_0I 

IS 



where F and H are the target' state transition matrix and 
observation matrix respectively, W\ and v k the noises of state 
variable and observation variable respectively, both assumed to be 
independent and irrelevant zero-mean Gaussian noise. To 
estimate the motion state x k at k, we first predict the current state 
according to the previous estimated state. Because the fish motion 
between adjacent images generally differs slighdy, hence the 
constant velocity model is applied to predict the state at the next 
time: 
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Table 2. Detection performance on different groups. 





Group Size 


Precision 


Recall 


Number of Occlusions 


OR 


ODR 


DT 


Al (10 fish) 


0.998 


0.992 


125 


0.025 


0.912 


1.75 


A2 (20 fish) 


0.990 


0.989 


640 


0.064 


0.889 


1.79 


A3 (40 fish) 


0.971 


0.969 


3040 


0.152 


0.846 


1.88 



doi:1 0.1 371 /journal.pone.01 06506.W02 



'-Fx, 



(10) 



a 




P k =FPk-\F + Q F 



Figure 8. (a) Example of frame image illustrating the detection 
results of the fish head regions for a group of 40 fish; (b) Some 
examples of occlusion events efficiency resolved and a rare 
case where the detection failed. 

doi:1 0.1 371 /journal.pone.01 06506.g008 
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(11) 



where Al is the sampling time interval of two adjacent frames, XZ 
and PZ the model's priori estimation of state variable and error 
covariance at k respectively. £9 represents the covariance matrix of 
the state noise wk. 

After data association, assume XZ 's associated observation 
variable asz^T , update the current state according to the equation 
below: 



Xk - 



+K k (z k -Hx k )H = 



10 0 0 
0 10 0 



Pk=p k 



-K k H k PZ 



(12) 



(13) 



Where K k is the gain parameter at current time, which can be 
expressed as: 



K k = PZ H T (HPZ H T + RY 



R- 



1 0 

0 1 



(14) 



where R is the covariance matrix of the observation noise v k . 

In the above motion prediction, the constant velocity model is 
chosen because of the similarities between fish school' motion and 
the motion it describes, as well as its simple calculation for 
attainability. However, the motion of fish school in some cases is 
random, where constant velocity model cannot deal with. That is, 
the model itself cannot give a complete description of all the 
motions, bound to cause some errors in the subsequent tracking. 
Based on the statistical analysis of the test data, the change of fish 
head motion between two adjacent frames (1/30 second) is 
generally within ±45 degrees. Then we design a compensation 
window to track the targets failing to be predicted according to this 
law. As shown in Figure 5(a), the compensation window prescribes 
the detected ellipse' direction as the target's moving direction at 
current time, the target' possible moving range as a quarter circle 
region with the ellipse center as the center and its long axis as the 
radius. If the prediction fails, the compensation window will be 
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Table 3. Compared methods. 




Number 


Motion Prediction 


Data Association 


Trajectory Linking 


Methodl 


None 


Proposed 


Proposed 


Method2 


Proposed 


Nearest neighbor association [25] 


Proposed 


Method3 


Proposed 


Probabilistic data association [26] 


Proposed 


Method4 


Proposed 


Proposed 


None 


Method5 


Proposed 


Proposed 


Proposed 





doi:1 0.1 371 /journal.pone.01 06506.t003 



a 




■ Methodl 

■ Method2 
Method3 

■ Method4 

■ Method5 



A 1(10 fish) 



A2 (20 fish) 



A3 (40 fish) 



1.4 
1.35 

1.3 
1.25 

1.2 
1.15 

1.1 
1.05 
1 

0.95 
0.9 



nil 



I Methodl 
I Method2 
I Method3 
I Method4 
Method5 



A 1(10 fish) 



A2 (20 fish) 



A3 (40 fish) 



Figure 9. Performance of compared methods on two evaluation metrics, (a) TCF; (b) TFF. As fish density increases, tracking performance of 
all five methods falls. In comparison, the proposed method offers highest TCF values and lowest TFF values, indicating its performance is the best 
among the compared methods. 
doi:1 0.1 371 /journal.pone.01 06506.g009 
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Figure 10. Tracking results on different groups with 16.7 seconds as duration. Left column: trajectory acquisition results with the time axis. 
Right column: trajectory acquisition results without the time axis, (a) A1 (10 fish); (b) A2 (20 fish); (c) A3 (40 fish). 
doi:1 0.1 371 /journal. pone.0106506.g010 



used for data association to make up for the shortage of the 
constant velocity model. 

2.3.2 Feature matching. Data association of state vectors 
and observations is a necessity for multi-target tracking. In order to 
optimize association accuracy, we employ feature matching. The 
key issue of feature matching is to find effective features that reflect 
the similarity among images of the same target and dissimilarity 
between images of different targets. The observation of sampling 
video finds that fish school generally moves horizontally in the 
shallow water, with very little vertical motion that can be ignored. 
That means the head region of the same target in different videos 
remains substantially constant. Based on the observed character- 
istics, we propose to use the width, area and grayscale information 
offish head region for feature matching. 

First, feature calculation. Based on the detection result in 2.2, 
active contour model [22] is used for the extraction of fish head 
contour and its initial region is the detected elliptical region. 
Assume (x,-, Ji) is a random point among the detected contour set, 
set the ellipse center (xoj'o) a s the inner endpoint, search the range 
with a directional angle of 9± 15 degrees for the farthest contour 
point (x a ,y a ) to the central point (Xojo) as the outer endpoint. Then 
draw a line passing (Xojo) an d perpendicular to the line of inner 
and outer endpoints. Set the intersection points of the line and the 
contour as (xjjj) and (x C! y c ), then the line of these two points is the 
cutting line, the closed region enclosed by the cutting line and the 
contour is the target' matching region, and the line of inner and 
outer endpoints and their directions is the alignment line of the 



matching region. Figure 5(b) shows the segmentation model of 
matching region. 

Here we do not directly use the detected ellipse region for 
feature matching but the positional relationship between the 
ellipse center and head contour to redesign a segmentation 
method as matching region. This is because that the ellipse region 
is estimated according to the grayscale adjacent to the central 
point, subject to the change of light in the swimming offish school. 
The proposed segmentation method takes advantage of the high 
stability of the ellipse center and the head contour, as well as active 
contour model' high accuracy in contour description, which 
together contribute to the obtained matching region' strong 
consistency between the adjacent images and therefore gready 
improve the accuracy of feature matching. 

After obtaining the matching region, the next is feature 
matching. Based on the degree of feature discrimination, we use 
cascade method for feature matching, which not only ensures the 
accuracy of matching, but also improves the matching speed. 

Let the cutting line at k-l as Z^-i and matching region MR^.i, 
then the width matching at k is expressed as: 

wm k = | Width(L k )- Width(L k - X )\ (15) 
where, Width is the width of the corresponding cutting line. 
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Area matching is expressed as: 



am k = - 



Area(MR k AMR k _i) 
Area(MR k ) 



(16) 



where, A is the symmetric difference set of the two matching 
regions after the alignment of alignment line, Area the area of the 
corresponding matching region. 
Grayscale matching is expressed as: 



_ ^ ( H MR k (0 - H M R k _ x (0) 
gm ~ V H MSk (i) + H MRkl (i) 



(17) 



where, H is the histogram of the corresponding matching region. 
The final feature matching result is defined as follows: 



result - 



1 , ;/ oo ewm k < a ]v &am k < a a &gm k < a g 
0, otherwise 



(18) 



Figure 6 shows the process of feature matching. If only one 
target' matching region at h matches successfully, then the target is 
the associated observation variable. If more than one targets' 
matching regions at k match successfully, the target with the 
smallest matching result min (wm k -am k -gm k ) is the associated 
observation variable. 

2.3.3 Trajectory linking. The occlusion occurs frequendy in 
swimming fish school, causing some detection errors and the 
failure of the complete matching between adjacent images and 
finally leading to the fragmentation of tracking trajectory. To solve 
this problem, we propose the following approach for trajectory 
handling on the basis of [23]: 

(1) If a state variable of the associated observation variable is 
found, update according to equation (12), and mark the 
state variable effective. 

(2) If no state variable of the associated observation variable is 
found, associate with a virtual observation variable, update 
according to x k = x^ and mark the state variable 
ineffective. If no observation variable on the trajectory is 
associated in 7\ consecutive frames, then the target 
probably keeps still, mark the trajectory incomplete and 
record the time el and position ep of the observation 
variable of the last effective state as the end tag of the 
trajectory. 

(3) If no observation variable of the associated state variable is 
found, we initialize the tracking and record the time st and 
position sp of the observation variable as the start tag of a 
new trajectory. The following tracking will see the two 
situations: A. if the observation variable is caused by error 
detection, it will last only a few frames of time, then we 
remove trajectory with an even less time of duration than 
this one; B. if the observation variable is generated by re- 
emerging target after occlusion, then mark the trajectory 
incomplete. 

After the above process, we begin trajectory linking. Assume p 
is an incomplete trajectory with end tag and P an incomplete 
trajectory with start tag. Define the constraint as below: 



Time constraint : TimeiTiXj)^ 
1 if \<T j (st)-T i (et)<T 2 
0 otherwise 



(19) 



The above equation indicates that, if the initial time of 
trajectory P is later than the end time of trajectory P, and time 
difference is less than T 2 , then the two trajectories meet time 
constraint. 



Space constraint : Space(Ti,Tj) = 

i if \rfcp)-rj(sp)\<D 

0 otherwise 



(20) 



The above equation indicates that, if the distance between the 
end position of p and start position of Tj is less than D, then the 
two trajectories meet space constraint. 

If two trajectories meet both time constraint and space 
constraint at the same time, then they maybe belong to the 
trajectory fragments of one trajectory. Next, we conduct feature 
matching on the observation variables represented by the start tag 
and end tag. If the matching is successful, connect the two 
trajectories. Figure 7 shows an example of trajectory linking. 

Experiments and Discussions 

We have conducted experiments to evaluate the performance of 
the proposed method in tracking multiple swimming fish. The 
experimental apparatus is shown in Figure 1(a). The fish are 2— 
3 cm long, and swim in a square acrylic tank of size 30 cmx30 cm 
filled with water of 3 cm deep. Fish behavior is relatively quiet and 
several motion modes (regular acceleration and deceleration, 
glide-and-burst, rapid and explosive motion) are present in the 
experimental data. A Flare 4M180-CL camera by IO Industries is 
placed above the tank at a distance to capture the entire arena. In 
order to evaluate the proposed method more challengingly, we 
leave the noise at the bottom of fish tank and the disturbance of 
the suspended matter in the water with no special processing. The 
computing facility includes a desktop computer with Intel 15 
2.3 GHz processor, 4G RAM, GF9400 graphics card and Madab 
programming environment. In order to test the tracking perfor- 
mance of different fish schools, we choose zebrafish (Danio rerio) 
with different densities in 3 groups: Al (10 fish), A2 (20 fish), A3 
(40 fish), the video for each fish group contains 500 frames. The 
time resolution of the camera is 30 frames per second and image 
resolution 2048x2040 pixels. The parameter settings are shown in 
Table 1. 

3.1 Evaluation of detection performance 

We first carry out target detection for each frame image in the 
video. In order to quantify the performance of the proposed 
detection method, the precision and recall ratios that are widely 
adopted for evaluating object detection methods are used in the 
experiment. They are defined as follows: 



Precision = 



true positive 



true positive + false positive 



(21) 
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Recall - 



true positive 



true positive -\- false negative 



(22) 



where true positive is the total number of correctly detected 
regions in all frames; false negative is the total number of missed 
regions; false positive is the total number of wrongly detected 
regions. 

In addition, in order to better evaluate the detection perfor- 
mance of the proposed method in the case offish occlusion, we set 
up three additional evaluation criteria: OR (occlusion ratio), ODR 
(occlusion detection ratio) and DT (detection time). 



OR- 



total number of occlusions in all frames 
total number of targets in all frames 



(23) 



and ground truth target position over all frames. The cost of an 
association is defined as the sum of distances between the obtained 
trajectories and the associated ground truth trajectories. According 
to the cost, we can work out an optimal association A* that 
minimizes this cost. 

Define two evaluation metrics to evaluate the tracking method. 
The first is TCF (trajectory completeness factor): 



TCF = 



E t Er^( G<) IQgj.gJI 

E, \G,\ 



(27) 



where A(G,) is the set of obtained trajectories associated with G, in 
A*. It indicates the average ratio of one ground truth trajectory 
length covered by the obtained trajectories. The smaller the value 
is, the little the accuracy is. The second evaluation metric TFF 
(trajectory fragmentation factor) can be defined as: 



ODR-- 



successful number of occlusion detection in all frames 

(24) 

total number of occlusions in all frames 



DT - 



total detection time 
total number of frames 



(25) 



The detection results are as shown in Table 2. From the results 
we can see that with the increase of fish school density, the 
occlusion ratio rises and the detection performance gradually 
declines. The fish school occlusion makes the head region invisible 
and then leads to the detection errors. In spite of this, the Precision 
ratio of the three groups of videos are maintained at over 0.971 
and the Recall ratio are maintained at over 0.969, which fully 
proved the effectiveness of our detection method. Furthermore, 
occlusion detection ratio shows that, although fish occlusion brings 
some difficulties, the proposed detection method still demonstrates 
strong detection ability under occlusion. Because most occlusions 
are caused by fish body or tail rather than head, our method is 
then able to detect most occluded targets. Finally, seen from the 
detection time, the detection time in three groups are all within 
1.9 seconds, and no significant change occurs with the increase of 
detection quantity, which indicates a good time performance in 
detecting target population. Figure 8 shows some detection results. 
As can be seen from Figure 8(b), when the target is occluded but 
the head region is visible, our detection method can detect the 
target's location and direction according to the local information. 

3.2 Evaluation of tracking performance 

After detecting and locating each fish, we then track them 
throughout the video to obtain their motion trajectories. To 
evaluate the proposed tracking method quantitatively, we associate 
the obtained trajectories T with ground truth trajectories G by 
using the approach proposed by [24]. Make 0(Tj,Gi) to indicate 
the frames where Fj and G, overlap, then the distance between 
two trajectories is defined as: 



£(r,,G,)= 



l 



|0(r„G,)l te0 t^ Gi) 



\\ x 't x t\\ 



(26) 



where x t represents the target location on t. The above equation 
indicates the average distance between the obtained target position 



TFF = 



Ej \A(Gj)\ 
\{Gi\A(Gi)*0}\ 



(28) 



It describes the average number of gained trajectories used to 
match one ground truth trajectory. Larger value means worse 
effect of the method in tracking the targets. 

The tracking method consists of three parts: motion prediction, 
data association and trajectory linking. In order to better evaluate 
the proposed method, we use five methods with different schemes 
and compare them, and compared methods are shown in Table 3. 

Figure 9 shows the comparison results of different methods. 
Seen from the comparison of motion prediction, the adding of 
motion prediction performs much better than the single use of 
feature matching, especially in that, the more fish schools are, the 
more searching space for feature matching is needed, which will 
gradually lower the probability of successful matching and then 
lead to more tracking errors. By adding motion prediction into 
feature matching, the matching calculation drops and accuracy 
increases, thus tracking results are significandy improved. Seen 
from the comparison of data association, the tracking result of data 
association with feature matching performs significandy better 
than the nearest neighbor association [25] and the probabilistic 
data association [26], due to that the latter two only take account 
of fish school' motion as the association basis. In actual tracking, 
the motion state of fish school is quite complex, with frequent 
occlusion. The motion information itself can hardly complete 
accurate association, while feature matching takes full advantage 
of the fish school's appearance information and keeps the 
consistency of targets in the complex motion. With the increase 
of fish schools, the tracking performance of all three methods 
declines because of more occlusion. However, the comparison 
finds that, the tracking performance of our tracking method 
declines much more slightly than the other two, showing that the 
proposed tracking method has strong robustness in multi-target 
tracking. Seen from the comparison of trajectory linking, the TCF 
and TFF using trajectory linking method is superior to the 
unconnected, which indicates that the gained trajectories becomes 
more intact after trajectory linking. In addition, with the increase 
of fish schools, the number of gained trajectories after trajectory 
linking also increases, with a more obvious effect in high-density 
population than low-density population, indicating that the 
proposed method can better deal with trajectory fragmentation 
problem caused by occlusion. Acquired trajectories using the 
proposed method in different groups are shown in Figure 10. 
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3.3 Discussions 

In experimenting, we find out that for fish with rapid transition 
of swimming mode, Kalman filter will likely fail to predict a 
reasonable new state. Then we solve this problem by using a 
compensation window and trajectory linking method. In addition, 
we have performed preliminary experiments on images of golden 
shiner, paracherirodom innesi, tadpole and sperm, and results 
show that the proposed method can also detect and track the 
regions of their heads. The performance of the proposed method is 
closely related to the occlusion ratio. When the fish head region is 
occluded by other fish, the coordinate data of the target will be 
lost. The longer the occlusion time, the longer the coordinate data 
are lost. When the similarity between several matching regions is 
very high, feature matching may fail, which will lead to identity 
switch. The higher the density of fish group, the higher the 
probability of the occurrence of this situation. If the ratio between 
BL/LA (mean fish body length/length of arena) decreases, the 
occlusion ratio will go down, thus resulting in that fish can be 
tracked more easily. Conversely, if fish swim in polarized schools, 
it will cause severe occlusions, which will significantly increase 
tracking difficulties. The occlusion problem is the most difficult 
problem in multi-target tracking. Although we have tried to 
overcome it, the detection errors and tracking errors caused by 
occlusion cannot be completely avoided. 

Conclusion 

This paper proposes an effective method for detecting and 
tracking multiple fish swimming in shallow water with frequent 
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